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Abstract 

We compare different selection criteria to clioose the number of latent states of 
a multivariate latent Markov model for longitudinal data. This model is based on 
an underlying Markov chain to represent the evolution of a latent characteristic of a 
group of individuals over time. Then, the response variables observed at the differ- 
ent occasions are assumed to be conditionally independent given this chain. Maxi- 
mum likelihood of the model is carried out through an Expectation-Maximization 
algorithm based on forward-backward recursions which are well known in the hid- 
den Markov literature for time series. The selection criteria we consider in our 
comparison are based on penalized versions of the maximum log-likelihood or on 
the posterior probabilities of belonging to each latent state, that is the conditional 
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probability of the latent state given the observed data. A Monte Carlo simulation 
study shows that the indices referred to the log-likelihood based information cri- 
teria perform in general better with respect to those referred to the classification 
based criteria. This is due to the fact that the latter tend to underestimate the true 
number of latent states, especially in the univariate case. 

Keywords: Akaike Information Criterion Bayesian Information Criterion entropy 
mixture model multivariate latent Markov model Normalized Entropy Criterion. 

1 Introduction 

A crucial element in the literature about the wide class of mixture models (McLachlan 
and Peel, 2000) is represented by the choice of the number of mixture components, which 
represents a specific aspect related with the model selection process. For instance, this 
issue arises in the context of latent class (LC) models about the choice of latent classes and 
in the contexts of hidden Markov (HM) models for time-series and stochastic processes 
(Zucchini and MacDonald, 2009) and of latent Markov (LM) models (Wiggins, 1973) 
for longitudinal data. This last class of models is typically used when the interest is in 
describing the evolution of a latent characteristic of a group of individuals over time. They 
assume that one or more occasion-specific response variables depend only on a discrete 
latent variable, characterized by a given number of latent states, which follows a first- 
order Markov process (Bartolucci et al, 2013). The basic idea behind this assumption is 
that the latent process fully explains the observable behavior of a subject. Furthermore, 
the latent state to which a subject belongs to at a certain occasion only depends on the 
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latent state at the previous occasion. An LM model may also be seen as an extension 
of the LC model, in which the assumption that each subject belongs to the same latent 
class throughout the period of observation is suitable relaxed. 

In such a context the number of latent states is usually selected on the basis of the 
observed data, both in the case of the basic LM model or in the advanced versions that, 
for example, allow for the inclusion of observable individual covariates. Only in certain 
applications the number of latent states is a priori defined by the nature of the problem 
or by the interest of the research. However, states selection on the basis of the observed 
data implies that increasing the number of states often improves the fit of the model, as 
judged by the likelihood, but also the number of parameters. The same problem arises 
when selecting the number of components in a finite mixture model. 

The more common approaches which have been adopted to balance model fit and 
parsimony are based on information criteria constructed according to indices that are pe- 
nalized versions of the maximum log-likelihood. Among these criteria, the most common 
are the Akaike Information Criterion (AIC; Akaike, 1973) and the Bayesian Information 
Criterion (BIC; Schwarz, 1978). The first one is known as an estimator of the KuUback- 
Leibler discrepancy between the model generating the data and the fitted model. BIC 
may be instead seen as an asymptotic approximation of the integrated likelihood, which 
provides an estimator of a transformation of the Bayesian posterior probability of a can- 
didate model. Several alternative to the AIC criterion have been proposed in literature 
such as AIC3 of Bozdogan (1993) and the Consistent AIC (CAIC) criterion proposed by 
Bozdogan (1987) which are based on different penalization terms. It is important to men- 
tion that the information criteria are preferred to methods based on the likelihood ratio 
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test between nested models because the latter require bootstrap resampling procedure. 

In addition to the above log-likelihood based information criteria, classification based 
criteria have been proposed in literature, which allow us to measure the quality of the clas- 
sification provided by a model. The Normalized Entropy Criterion (NEC) is an approach 
first developed by Celeux and Soromenho (1996) to select the number of components in 
the context of mixture models. It is based on an entropy term computed on the basis 
of the posterior probabilities for every sample unit and mixture component. This crite- 
rion takes into account the quality of the classification, and then how well the clusters 
are separated, further to the goodness-of-fit of the model, which is measured in terms of 
log-likelihood. An entropy index has been recently proposed in HM literature to measure 
uncertainty involved in connection with finding the most likely sequence of the latent 
states; see Hernando et al (2005) and Durand and Guedon (2012). The entropy measure, 
however, has not been investigated as a tool for states selection in such a context. Among 
other classification based criteria, it is worth mentioning also the Classification Likelihood 
information Criterion (CLC), adopted by Biernacki and Govaert (1997) in the mixture 
context, and an approximation of the Integrated Classification Likelihood criterion (ICL; 
Biernacki et al, 1998) using BIC denoted as (ICL-BIC) firstly adopted by Biernacki et al 
(2000) and McLachlan and Peel (2000). 

In the context of finite mixture models and, in particular, of LC models, several stud- 
ies exist aimed at comparing the performance of the above mentioned criteria. Among 
others, Fraley and Raftery (2002) used BIC for clustering in mixture models, showing 
its satisfactory behavior (see also McLachlan and Peel, 2000, Ch. 8). Simulation studies 
have also been performed by Nylund et al (2006) for growth mixture and LC models. 
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and by Biernacki and Govaert (1999) for Gaussian mixtures. In both situations it was 
found that BIG outperforms the other information criteria. We also refer to Dias (2006) 
for a study refer to the LG model with binary response variables in which emerges that 
AIG3 is the best criterion for selecting the number of latent classes. Moreover, GAIG has 
been proved to have a similar performance with respect to BIG (Lin and Dayton, 1997). 
About the behavior of the classification based criteria, we refer to Biernacki and Govaert 
(1999), which found that NEG gives poor results in selecting the model under compar- 
ison, although it exhibits good behavior in detecting the number of clusters. Moreover, 
Biernacki et al (2000) showed that IGL appears to be more robust than BIG to violation 
of some of the mixture model assumptions. 

Even if these criteria are widely used in literature, their performance have not been 
studied enough in detail in connection with LM models. A comparison of AIG and BIG 
performance in connection with states selection of a univariate LM model may be found 
in Bartolucci et al (2013) [Gh. 7]. However, to our knowledge, there are no studies 
aimed at comparing the behavior of the different information criteria mentioned above. 
On the other hand their properties have been studied in the context of HM models; see, 
among others, Geleux and Durand (2008), Gosta and De Angelis (2010) and the references 
therein. However, the context is quite different, since HM models are used for time series, 
whereas LM models are applied to longitudinal data. The main purpose of this paper 
is to compare the performance of all the illustrated information criteria when applied to 
select the number of latent states in a multivariate LM model. 

We show a Monte Garlo simulation study on the basis of different model specifications, 
with respect to the number of response variables, and to different conditional response 
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probabilities and transition probabilities. The aim is to analyze the effect of these factors 
on selecting the number of states and to set up a comparison between log-hkelihood based 
and classification based criteria. In particular, in applying the NEC criterion, we consider 
an entropy measure based on the posterior probabilities of all the possible configurations 
of latent states, given the observed data, for every sample unit. We also consider two 
approximations of NEC which are based on a modified version of the entropy computed 
on the basis of the posterior probability of every single latent state at every time occasion. 

The article is organized as follows. In the following section we illustrate the multivari- 
ate LM model and we deal with maximum likelihood estimation of the model parameters. 
In Section 3 we illustrate the latent states selection criteria under comparison. In Section 
4 we show the results of a series of simulations made in order to assess the quality of the 
analyzed criteria. In Section 5 we provide main conclusions. 

2 The multivariate LM model 

In the multivariate formulation of the LM model (that includes the univariate one as a 
special case) we observe a vector of categorical response variables Y^^'^ = . . . , i;^*^), 
for t = 1, . . . ,T. Each variable Yj^\ j = 1, . . . ,r, has Cj categories, labeled from to 
Cj — 1. We denote by 1^ = {Y^^\ . . . , Y^'^^) the vector of observed responses made of the 
union of vectors Y^*\ which usually, is referred to repeated measurements of the same 
variables Yj {j = 1, ... ,r) on the same individuals at different time points. 

The model is based on two main assumptions. Firstly, the vectors Y^^^ are condition- 
ally independent given a latent process U = {U^^\ . . . , U^'^^), and the response variables 
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in each of these vectors are conditionally independent given f/^*^ at time t with state space 
{1, . . . , k}. In other words, each occasion-specific observed variable Yj*^ is independent 
of yi* y,-^^^ and of each yI^^\ for all /i 7^ j = 1, . . . , r, given t/*^*-*. This is the so 



'J , • • • , - J 

called local independence assumption. Secondly, the latent process U is assumed to fol- 
low a first-order Markov chain with k latent states, that is each latent variable U^*^ is 
independent of f/*^*~^\ . . . , U^^\ given U^*~^\ The resulting model is represented by the 
path diagram in Figure 1. 

v(\>C^) v(T) ^(T) 

1 ^ , . . . , 1 r 1 I , . . . , 1 r ' ' ' i I J ■ ■ ■ 1 r 



f/{T) 



Figure 1: Path diagram of the basic latent Markov model for multivariate data 



The model is characterized by three different types of parameters: 



the conditional response probabilities 



jy 



with j = 1, . . . , r, t = 1, . . . , T, u = 1, . . . ,k, and y = 0, . . . ,Cj — 1, which may be 
collected into the vector 



the initial probabilities 



TT. 



with u = 1, . . . ,k, 



• the transition probabilities 

with t = 2, . . . , T, M, = 1, . . . , fc. 

Note that all these probabilities do not depend on the specific sample unit. Moreover, 
it is possible to include a constraint on the transition probabilities corresponding to the 
hypothesis that the Markov chain is time homogeneous. Under this hypothesis, which is 
considered in the simulation study illustrated in Section 4, the transition probabilities do 
not depend on t, so as 

vrT"'^ = 7r„|., t = 2,...,T. 

u\v U\V1 1 J 

The number of free parameters of the multivariate LM model above is given by 

r 

#par = kj^jcj - 1) + - - 1)^ - (1) 



u I V 



The probability mass function of the distribution of U may be expressed as 



t=2 

and the conditional distribution of Y given U is 

T 

p{Y = y\U = u) = l[ - . . . 



t=i 



Therefore, the manifest distribution p{Y = y) of Y follows 



\ ST^ (211) ,(1) ,(T) 



Note that computing = y) involves all the possible k"^ configurations of vector it, that 
typically requires a considerable computational effort. In order to efficiently compute this 
probability we can use a forward recursion (Baum et al, 1970; Welch, 2003) for obtaining 

In particular, the t-th iteration of this recursion, for t = 2, . . . ,T consists of computing 

k 
v=l 

starting with q^y = 7iu4>?,} ? for t = 1. This recursion may be easily implemented by using 
matrix notation; see Bartolucci et al (2013) for details. 

2.1 Likelihood inference 

In an observed sample of n subjects, let n(y) be the frequency of the observed response 
configuration y, and assuming independence between the sample units, the model log- 
likelihood may be computed as 

y 

where is the vector of all model parameters arranged in a suitable way. The model log- 
likelihood may be maximized with respect to 6 by using the Expect at ion- Maximization 
(EM) algorithm of Dempster et al (1977) which represents the main tool to estimate 
this class of models. This algorithm is based on the concept of complete data, which 
is represented by the pair {u,y), where u denotes a realization of U. Therefore, the 
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complete data log-likelihood is given by 



r T fc c-1 



j=l t=l u=l y=0 
k T k k 



u\v 



u=l t=2 v=l u=l 



where a^^y corresponds to the frequency of subjects responding by y for the j-th response 
variable and belonging to latent state u at time t, bu'^ is the frequency of subjects in latent 
state u at time 1, and bvl corresponds to the frequency of subjects which move from latent 
state V to state u at time t. 

Since the latent configuration for each subject is not known the EM maximizes the 
log-likelihood above by alternating the following two steps until convergence: 

• E-step: compute the expected value of the above frequencies, given the observed 
data and the current value of the parameters, so as to obtain the expected value of 



• M-step: update by maximizing the expected value of i*{0) obtained above; 
explicit solutions for estimation are available at this aim, see Bartolucci et al 



The E-step of the algorithm involves the computation of the posterior probabilities 



t{e) 



(2013). 



f^'} and f^l^J'^- Using the following backward recursion 



k 




u=l 



starting with q^ y = 1, for t = 1, . . . , T, we have 

Jt) (t) 
At) _ 'ivmu,y 



U = 1, . . . , /c. 



(2) 



u\y 
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whereas, for t = 2, . . . , T and u,v = 1, . . . , /c, we have 




u\v,y pf^Y = y) 



(3) 



The recursions above may be implemented by using the matrix notation, as shown in 
Bartolucci (2006) and Bartolucci et al (2007). 

3 The class of states selection criteria 

As aheady discussed in Section 1, a crucial point in using LM models concerns the selection 
of the number of latent states k. When this number cannot be a priori defined, it is possible 
to rely on model selection criteria. In the following we illustrate the most common log- 
likelihood based information criteria together with classification based criteria which take 
the quality of classification into account. 

3.1 Log-likelihood based information criteria 

The information criteria are based on indices that are, essentially, penalized versions of 
the maximum log-likelihood. The two most common criteria of this type are AlC and 
BIC. The first criterion, proposed by Akaike (1973), is a measure of the relative goodness 
of fit of a model, which describes the tradeoff between accuracy and complexity of the 
model. In particular, AIC is based on estimating the KuUback-Leibler distance between 
the true density and the estimated density, which focuses on the expected log-likelihood, 
and is defined on the basis of the following index 



AIC 



2 £(6>) + 2#par. 



(4) 
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For a given model, i denotes the maximum of the log-hkehhood of the LM model of 
interest and #par denotes the number of free parameters as defined in (1). According to 
this criterion, the optimal number of latent states is that corresponding to the minimum 
value of the index in (4). In practice, we fit the LM model for increasing values of k 
until the index does not start to increase. Then, we select the previous k as the optimal 
number of latent states, which guarantees the best compromise between goodness-of-fit 
and model parsimony. 

The BIG criterion of Schwarz (1978) is derived, for regular models, as an approximation 
to twice the log integrated likelihood (Kass and Raftery, 1995), using the Laplace method 
(Tierney and Kadane, 1986). From the asymptotic behavior of this approximation, the 
corresponding index may be defined as 

BIG = -2 i{e) + #par log(n), (5) 

with i and #par defined as above. In certain settings, model selection based on BIG 
is roughly equivalent to model selection based on Bayes factors; see among others Kass 
and Raftery (1995). The number of latent states k to be selected is the one which cor- 
responds to the minimum value of the index in (5). Usually, BIG leads to selecting a 
smaller number of latent states than the AIG criterion, since it is based on a more se- 
vere penalization. This difference may be relevant in complex model. In particular, the 
BIG criterion is expected to perform better as the amount of information increases with 
respect to the model complexity. In the LM literature, the same criteria have been used 
for model selection by Langeheine (1994), Langeheine and Van de Pol (1994), Magidson 
and Vermunt (2001), among many others. Finally, a comparison of their performance in 
connection with states selection of a univariate LM model may be found in Bartolucci 
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et al (2013), Ch.3. Moreover, comparisons between AIC and BIG criteria can be found 
in the literature of mixture models (McLachlan and Peel, 2000, Ch. 6), and in the HM 
literature for time series. From these studies, it emerges that BIG is usually preferable to 
AIG, as the latter tends to overestimate the number of states. 

Among the variants of the AIG criterion existing in literature we also consider the 
criterion introduced by Bozdogan (1993). In particular, this criterion defines a more 
penalized version of the index in (4), on the basis of the results in Wolfe (1970), so as to 
obtain 

AIG3 = -2 i{e) + 3#par, (6) 

in which the penalizing term 2 is substituted with 3. 

On the other hand, the Gonsistent AIG criterion (GAIG), proposed by Bozdogan 
(1987), includes a penalizing term which also takes into account the sample size n, and is 
defined as 

GAIG = -2^>) + #par(log(n) + l). (7) 

Further to the above information criteria that are aimed at measuring the goodness 
of fit of a model, we also consider criteria that take into account the performance of the 
classification procedure, as outlined in the following. 

3.2 Classification based information criteria 

The criteria developed in the context of the classification likelihood approach, also known 
as complete data information criteria, are based on data augmentation, that is, the com- 
plete data as defined in Section 2.1. These criteria consider the following relation, that 
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was first showed by Hathaway (1986) 

r(6>) = £(6>) -EN, (8) 

see also Celeux and Soromenho (1996) and Biernacki and Govaert (1997), where EN is an 
entropy measure, which involves the posterior probabilities of component membership of 
each subject belonging to a specific group. Such entropy may be seen as a penalization 
term which is a measure of the ability of the model to provide a relevant partition of the 
data. More in detail, if the components are well separated, the posterior probabilities 
tend to define a partition of the data, assuming values close to 1. As a consequence, the 
entropy will be close to 0. The entropy measure cannot be directly used to assess the 
number of clusters since £{0) is an increasing function of k and has to be renormalized. 
With reference to the mixture models, Celeux and Soromenho (1996) proposed to consider 
the NEC criterion, which is expressed by 

EN 

NEC = , > 2, (9) 

i,(e)-i,ioy - ' 

where £k{0) is the maximum log-likelihood in case of a k components mixture and ii{0) 
is the maximum log-likelihood in case of a 1 component mixture. As also illustrated in 
Biernacki and Govaert (1997), NEC must assume small values to obtain a compromise 
between a good classification feature and a good description of the data. Then, the 
optimal number of components is the one that minimizes the index in (9). It is worth 
noting that the NEC criterion is not defined when A; = 1; to deal with this problem 
Biernacki et al (1999) proposed an empirical version of the NEC for Gaussian mixture. 
Usually it is convention to use NEC = 1 for = 1. 

In extending NEC to LM models, the difficulty is in considering entropy based on the 
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posterior probabilities of all the possible configurations of latent states, given the observed 
data, for every sample unit. Therefore, this "true" entropy can be computed only when 
we have a reduced number of times occasions and latent states. In the context of HM 
models this measure is defined by Hernando et al (2005) as 

EN = - J] ... J] fm,...UT\y ^og{fu,,...UT\y) = 

Esr^ Al) _ .(2|i) _ _ At\t-i) . . AT\T~i) 
• • • / J Jui \v Ju2\ui,y ■■■ Jut\ut-i,y •■• ■'UT\uT-i,y 
ui Ut 

■ [log(/Sy) + log(/i^|lU) + • • • + ^^^(futly) + • • • + ^^^(fZtly)] 

where f^l^^y and f'^^^y'^ are defined in equations (2) and (3), respectively. 

In the context of LM models, we may simplify the above equation for EN by formu- 
lating an approximated version that allows us to compute entropy also for any number 
of time occasions and latent states. More precisely, under the assumption that u*^*^ are 
independent given Y , we define EN as follows 

EN.=-E---i:/i;ii°8(0. 

A possible renormalized variant of ENi may also be expressed as 

Therefore, we consider a NEC criterion relying on the "true" entropy based on the poste- 
rior probabilities of all possible configurations of latent states, and two different approx- 
imated versions, which may be computed for any number of time occasions and latent 
states. As an example we suppose to observe subjects at three occasions (T = 3), then 
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the above criteria may be explicitly written as follows 



EN = - ^ fu,v,z\y l0g(/„,„,^|y) 



U V z 



^ .{3|2) _ .{2|1) _ 

z\v,y J v\u,y J u\y 

■ [log(/iJIS) + log(/iS) + 

ENi = -[/S ■ log(/S) + f'Z ■ log(/lf^) + f^^ ■ log(/if))] 

EN2 = -ENi 
3 

According with the entropy measures defined above we consider three different versions 
of the NEC criterion where the first one is based on the "true" entropy, as defined in (9), 
and the other two versions are expressed as 

ENi 

NECi = \ , k>2] 

EN2 

NEC2 = = , k>2. 

Among other criteria which take the quality of classification into account, we also 
consider the CLC criterion, proposed by Biernacki and Govaert (1997) in the mixture 
context, which uses the relation in (8) to define the following index 

CLC = -2i{e) + 2 EN. 

Moreover, Biernacki et al (1998) suggested an alternative information criterion based 
on the complete data likelihood named as Integrated Classification Likelihood criterion 
(ICL). The same authors also proposed an approximated version of the ICL using BIC 
(Biernacki et al, 2000). In particular, McLachlan and Peel (2000) referred to this approx- 
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imated version as ICL-BIC and showed that may be computed as 

ICL-BIC = BIG + 2 EN, 

in which the term 2 EN represents a kind of penahzation for poorly separated clusters; 
see also Li (2005). 

As already discussed in Section 1, although the above information criteria are widely 
used and their performance are studied in the context of finite mixture, LC, and HM 
models, there is still a lack in the literature about their comparison in the context of LM 
models. In the following section we set an experimental design to compare the performance 
of these criteria in the context of multivariate LM model, in order to choose the optimal 
number of latent states. 

4 Simulation study 

We illustrate the results obtained by a Monte Carlo simulation study aimed at comparing 
the performance of the following indices for states selection: 

• Log-likelihood based criteria: AIC, CAIC, AIC3, BIG; 

• Glassification based criteria: GLG, IGL-BIG, NEG, NEGi, NEG2. 

More in detail, we simulate 100 samples with a given size n {n = 250, 500) and coming 
from an LM model, characterized by r (r = 1, 3, 5) binary {y = 0, 1) response variables 
observed in T = 5 time occasions, k {k = 2, 3) latent states, and given values of initial 
probabilities vr^, transition probabilities 7r^*[J~^\ and conditional response probabilities 
</)^*Jl^. All analyses are implemented in R software (the code is available upon request by 
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authors). In all cases, the strategy adopted to choose the number of latent states is the 
same: we fit the LM model with increasing k values and, then, the value just before the 
first increasing criterion index is taken as optimal number of latent states. 

We first consider a scenery (scenery 1) based on n = 250 individuals belonging to 
k = 2 latent states and observed on T = 5 time occasions. Moreover, we suppose equal 
initial probabilities, that is tti = 0.5 = 7r2, and denoting by 11 the transition probabilities 
matrix with elements 7r„|^, under the time homogeneous assumption, we consider 

[O.l 0.9j 

We also assume alternatively r = 1,3,5 binary response variables with the following 
matrix $ of the conditional response probabilities 

/0.8 0.2\ 

Vo.2 o.sy 

Table 1 shows the relative frequencies of the number k of components chosen by each 
of the considered criteria, both in univariate case (r = 1) and in multivariate cases (r = 3 
and r = 5). 

In the univariate case, all log-likelihood based criteria perform very well, whereas the 
performance of classification based criteria is very bad: they tend to underestimate k in 
almost all cases. Instead, in the multivariate cases (r = 3 and r = 5) the classification 
based criteria improve considerably their performance, being the selected number of latent 
states equal to 2 in almost all cases. We only observe a worsening of AIC, which tends 
to overestimate the right number of latent states in 17 and 29 cases out of 100, for r = 3 
and for r = 5 respectively. 
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Table 1: Relative frequencies of k chosen on the basis of several criteria (scenery 1, 



k BIG AIC AIC3 CMC NEC NECi NEC2 CLC ICL-BIC 



1 




0.00 


0.00 


0.00 


0.00 


1.00 


0.95 


1.00 


1.00 


1.00 


2 




1.00 


0.99 


1.00 


1.00 


0.00 


0.05 


0.00 


0.00 


0.00 


3 




0.00 


0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


4 




0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


5 




0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


r = 
1 


3 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


2 




1.00 


0.83 


1.00 


1.00 


0.99 


1.00 


1.00 


0.98 


1.00 


3 




0.00 


0.14 


0.00 


0.00 


0.00 


0.00 


0.01 


0.00 


0.00 


4 




0.00 


0.03 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


5 




0.00 


0.00 


0.00 


0.01 


0.00 


0.00 


0.01 


0.00 


0.00 


r = 
1 


5 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


2 




1.00 


0.71 


0.97 


1.00 


0.99 


1.00 


1.00 


0.93 


1.00 


3 




0.00 


0.26 


0.03 


0.00 


0.01 


0.00 


0.00 


0.05 


0.00 


4 




0.00 


0.03 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


5 




0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.02 


0.00 



An alternative scenery (scenery 2) is then considered, which differs from scenery 1 for 
lower values of state persistence probabilities given by 



All the other elements are the same than those considered in scenery 1. Results are shown 



With respect to scenery 1 we note several differences. In the univariate case, the 
behavior of both BIC and CAIC gets worse: BIC leads to select the true value = 2 in 
less than 50% of cases and CAIC in 37% of cases, whereas in the remaining simulations 



n = 250) 



r 




in Table 2. 
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Table 2: Relative frequencies of k chosen on the basis of several criteria (scenery 2, 
n = 250) 



k BIG AIC AIC3 CMC NEC NECi NEC2 CLC ICL-BIC 



r = 


1 




















1 




0.52 


0.00 


0.10 


0.63 


1.00 


1.00 


0.99 


1.00 


1.00 


2 




0.48 


0.98 


0.90 


0.37 


0.00 


0.00 


0.01 


0.00 


0.00 


3 




0.00 


0.02 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


4 




0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


5 




0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


r = 
1 


3 


0.00 


0.00 


0.00 


0.00 


0.88 


0.92 


0.00 


0.88 


0.95 


2 




1.00 


0.83 


0.98 


1.00 


0.10 


0.07 


0.96 


0.10 


0.04 


3 




0.00 


0.16 


0.02 


0.00 


0.01 


0.01 


0.04 


0.01 


0.01 


4 




0.00 


0.01 


0.00 


0.00 


0.01 


0.00 


0.00 


0.01 


0.00 


5 




0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


r = 
1 


5 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


2 




1.00 


0.77 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


3 




0.00 


0.15 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


4 




0.00 


0.06 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


5 




0.00 


0.02 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 



both of them underestimate k. A slightly worsening is observed also for AIC3, which in 
the 10% of simulated models chooses only one latent state. With r = 3 responses, on 
one hand BIC, AIC3, CAIC and NEC2 considerably improve their performance and, on 
the other hand, AIC tends to overestimate k in 17 cases out of 100, obtaining in both 
situations values similar to those of scenery 1. However, the classification based criteria 
other than NEC2 improve just a little and continue to underestimate k in the main part of 
cases, showing a really different behavior with respect to scenery 1. Similarly to scenery 
1, with r = 5 all the considered criteria present an optimal behavior (with the exception 
of AIC, which overestimates k in 23 cases out of 100). 
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Another scenery (scenery 3) is then considered, which differs from scenery 1 for a 
greater uncertainty in the allocation of the observations to the latent states, being the 
conditional response probabilities matrix given by 



All the other elements are the same than scenery 1. Results are shown in Table 3. 
Table 3: Relative frequencies of k chosen on the basis of several criteria (scenery 3, 



k BIG AIC AIC3 CAIC NEC NECi NEC2 CLC ICL-BIC 



1 




0.35 


0.01 


0.02 


0.53 


1.00 


1.00 


1.00 


1.00 


1.00 


2 




0.65 


0.98 


0.97 


0.47 


0.00 


0.00 


0.00 


0.00 


0.00 


3 




0.00 


0.01 


0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


4 




0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


5 




0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


r = 
1 


3 


0.00 


0.00 


0.00 


0.00 


0.99 


1.00 


0.08 


0.99 


1.00 


2 




1.00 


0.79 


1.00 


1.00 


0.01 


0.00 


0.88 


0.01 


0.00 


3 




0.00 


0.18 


0.00 


0.00 


0.00 


0.00 


0.03 


0.00 


0.00 


4 




0.00 


0.03 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


5 




0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.01 


0.00 


0.00 


r = 
1 


5 


0.00 


0.00 


0.000 


0.00 


0.285 


0.770 


0.000 


0.285 


0.550 


2 




1.00 


0.78 


0.995 


1.00 


0.590 


0.220 


0.980 


0.585 


0.445 


3 




0.00 


0.205 


0.005 


0.00 


0.030 


0.005 


0.015 


0.035 


0.005 


4 




0.00 


0.01 


0.00 


0.00 


0.070 


0.005 


0.005 


0.070 


0.000 


5 




0.00 


0.005 


0.00 


0.00 


0.025 


0.000 


0.000 


0.025 


0.000 



With respect to scenery 1, in presence of r = 1 response variables the behavior of BIG 
and CAIC is not very satisfactory, because they tend to underestimate k in 35% and 53% 
of cases. Concerning the classification based criteria, we note a significant deterioration 




n = 250) 



r 



1 
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of their behavior both in univariate case and in multivariate cases. Only NEC2 presents 
a satisfactory performance, being the correct number of k selected in 88 cases out of 100 
with r = 3 and in 98 cases out of 100 with r = 5 (and overestimated in the remaining 
cases). Instead, the remaining criteria lead to choose /c = 1 in almost all cases when r = 3, 
improving just a little when r = 5. More precisely, in this last case, NEC and CLC allow 
us to select the right number of k in the 59% of cases, whereas they underestimate k in 
28.5% of cases. Moreover, ICL-BIC leads to select k = 1 for the 55% of simulated models 
and NECi for the 77% of them. 

The three above described sceneries are then replicated by increasing the number of 
observations from n = 250 to n = 500, all the other things being constant. Results are 
shown in Tables 4, 5, and 6 for sceneries 1, 2, and 3 respectively. 

By increasing the number of observations we note a considerable improvement of 
performances of BIG and CAIC in the univariate cases of sceneries 2 and 3, whereas 
the behavior of the other criteria, especially of classification based criteria, is unchanged. 
Rather, in case of scenery 3, when r = 5 response variables are considered, the behavior of 
NECi and ICL-BlC gets worse, being k = 2 selected in 10% and 43% of cases for n = 500 
against 22% and 44.5% of cases for n = 250. 

To conclude, we also consider two further sceneries (sceneries 4 and 5) characterized 
by n = 500 individuals and k = 3 latent states. We also suppose T = 5, equal initial 
probabilities, that is tti = 112 = ir^ = 1/3, conditional response probabilities matrix equal 
to 

/0.9 0.1 0.7\ 
\0.1 0.9 0.3; ' 
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Table 4: Relative frequencies of k chosen on the basis of several criteria (scenery 1, 



n = 500) 



k 




BIG 


AIC 


AIC3 


CMC 


NEC 


NECi 


NEC2 


CLC 


ICL-BIC 


r = 

1 

2 

3 

4 

5 


1 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
0.99 

0.01 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


1.00 

0.00 
0.00 
0.00 
0.00 


1.00 

0.00 
0.00 
0.00 
0.00 


0.98 

0.02 
0.00 
0.00 
0.00 


1.00 

0.00 
0.00 
0.00 
0.00 


1.00 

0.00 
0.00 
0.00 
0.00 


r = 

1 

2 

3 

4 

5 


3 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
0.87 

0.13 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
0.99 

0.01 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
0.99 

0.01 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


r = 

1 

2 

3 

4 

5 


5 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
0.78 

0.17 
0.05 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 



and the following transition probabilities (under the time homogeneous assumption) 

/0.90 0.05 0.05\ 
n = 0.05 0.90 0.05 , 
\0.05 0.05 0.90/ 

in the first case (scenery 4) and 

/0.70 0.15 0.15\ 
n= 0.15 0.70 0.15 , 
\0.15 0.15 0.70/ 

in the second case (scenery 5). 

With respect to the cases with k = 2 latent states, we now observe a very poor 
performance of all criteria in the univariate case (r = 1): log- likelihood based criteria 
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Table 5: Relative frequencies of k chosen on the basis of several criteria (scenery 2, 



n = 500) 



k 




BIG 


AIC 


AIC3 


CMC 


NEC 


NECi 


NEC2 


CLC 


ICL-BIC 


r = 

1 

2 

3 

4 

5 


i 


0.05 

0.95 

0.00 

0.00 

0.00 


0.00 
0.99 

0.01 
0.00 
0.00 


0.01 
0.99 

0.00 
0.00 
0.00 


0.08 

0.92 

0.00 

0.00 

0.00 


1.00 

0.00 
0.00 
0.00 
0.00 


1.00 

0.00 
0.00 
0.00 
0.00 


1.00 

0.00 
0.00 
0.00 
0.00 


1.00 

0.00 
0.00 
0.00 
0.00 


1.00 

0.00 
0.00 
0.00 
0.00 


r = 

1 

2 

3 

4 

5 


3 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
0.74 

0.23 
0.02 
0.01 


0.00 
0.99 

0.01 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.97 

0.03 
0.00 
0.00 
0.00 


0.98 

0.02 
0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.97 

0.03 
0.00 
0.00 
0.00 


0.98 

0.02 
0.00 
0.00 
0.00 


r = 

1 

2 

3 

4 

5 


5 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
0.74 

0.17 
0.08 
0.01 


0.00 
0.98 

0.02 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 



lead to select k = 2 and classification based criteria lead to select /c = 1, in almost all 
cases (Tables 7 and 8). Instead, in presence of r = 3 response variables, the behavior of 
log-likelihood based criteria gets better, especially under scenery 4. Indeed, Table 7 shows 
that with AIC and AIC3 the right number of latent states is chosen in 91 and 98 cases 
out of 100 respectively, whereas with BIC and CAIC the percentages of right choices are 
reduced to 59% and 39% respectively, being k underestimated in the remaining cases. On 
the other hand, under scenery 5 (Table 8), which refers to a situation with lower latent 
states persistence probabilities, the improvement is far from clear: /c = 3 is selected by 
AIC in 75% of cases and by AIC3 in 35%, whereas BIC and CAIC lead to choose regularly 
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Table 6: Relative frequencies of k chosen on the basis of several criteria (scenery 3, 



n = 500) 



k 




BIG 


AIC 


AIC3 


CMC 


NEC 


NECi 


NEC2 


CLC 


ICL-BIC 


r 
1 

2 
3 
4 
5 


1 

Q 


0.01 

0.99 

0.00 

0.00 

0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.03 

0.97 

0.00 

0.00 

0.00 


1.00 

0.00 
0.00 
0.00 
0.00 


1.00 

0.00 
0.00 
0.00 
0.00 


1.00 

0.00 
0.00 
0.00 
0.00 


1.00 

0.00 
0.00 
0.00 
0.00 


1.00 

0.00 
0.00 
0.00 
0.00 


/ 

1 

2 
3 
4 

5 


o 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
0.86 

0.12 
0.01 
0.01 


0.00 
0.97 

0.03 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


1.00 

0.00 
0.00 
0.00 
0.00 


1.00 

0.00 
0.00 
0.00 
0.00 


0.04 
0.96 

0.000 
0.000 
0.000 


1.00 

0.00 
0.00 
0.00 
0.00 


1.00 

0.00 
0.00 
0.00 
0.00 


r = 

1 

2 

3 

4 

5 


5 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
0.67 

0.27 
0.04 
0.02 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.32 
0.63 

0.01 
0.03 
0.01 


0.90 

0.10 
0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.32 
0.63 

0.01 
0.03 
0.01 


0.57 

0.43 
0.00 
0.00 
0.00 



k = 2. Finally, with r = 5 responses we observe satisfactory performances of log-likelihood 
based criteria (although AIC overestimates k in 15% of cases) in case of scenery 4 (Table 
7), but, again under scenery 5, results are not very satisfactory, because BIC and CAIC 
performs well in only 76% and 52% of cases. The behavior of classification based criteria 
is definitely disappointing under both sceneries and both with r = 3 and r = 5. 

5 Conclusions 

In this paper we investigated about a typical issue characterizing some latent variable 
models, such as latent class models or hidden Markov (HM) models, consisting in the 
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Table 7: Relative frequencies of k chosen on the basis of several criteria (scenery 4) 



k 




BIG 


AIC 


AIC3 


CMC 


NEC 


NECi 


NEC2 


CLC 


ICL-BIC 


r = 

1 

2 

3 

4 

5 


1 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
0.93 

0.07 
0.00 
0.00 


0.00 
0.99 

0.01 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


1.00 

0.00 
0.00 
0.00 
0.00 


1.00 

0.00 
0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


1.00 

0.00 
0.00 
0.00 
0.00 


1.00 

0.00 
0.00 
0.00 
0.00 


r = 

1 

2 

3 

4 

5 


3 


0.00 
0.41 
0.59 

0.00 
0.00 


0.00 
0.01 
0.91 

0.07 
0.01 


0.00 
0.02 
0.98 

0.00 
0.00 


0.00 
0.62 

0.39 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


r = 

1 

2 

3 

4 

5 


5 


0.00 
0.00 
1.00 

0.00 
0.00 


0.00 
0.00 
0.85 

0.13 
0.02 


0.00 
0.00 
0.99 

0.01 
0.00 


0.00 
0.00 
1.00 

0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 


0.00 
1.00 

0.00 
0.00 
0.00 



choice about the number of mixture components (i.e, latent classes or latent states). More 
precisely, we focused on the selection of latent states in univariate and multivariate latent 
Markov (LM) models for longitudinal data. We firstly illustrated the assumptions and 
the structure of LM model, giving some hints about the maximization of log-likelihood 
on the basis of an EM algorithm. Then, we described some of the most well-known 
model selection criteria used in the context of mixture models, distinguishing between 
log-likelihood based criteria (i.e., AIC, AIC3, CAIC, BIC) and classification based criteria 
(NEC, CLC, ICL-BIC). Concerning this latter type of criteria, we gave some emphasis to 
the problem of properly defining the entropy in case of LM models. Relying on the case of 
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Table 8: Relative frequencies of k chosen on the basis of several criteria (scenery 5) 



k BIG AIC AIC3 CMC NEC NECi NEC2 CLC ICL-BIC 



r = 


1 




















1 




0.00 


0.00 


0.00 


0.00 


1.00 


1.00 


1.00 


1.00 


1.00 







i.UU 


u.y / 


i.UU 


1 nn 
i.UU 


n nn 
U.UU 


n nn 
U.UU 


n nn 
U.UU 


n nn 
U.UU 


n nn 
U.UU 


Q 




n nn 
U.UU 


n nQ 
U.Uo 


n nn 
U.UU 


n nn 
U.UU 


n nn 
U.UU 


n nn 
U.UU 


n nn 
U.UU 


n nn 
U.UU 


n nn 
U.UU 


4 




n nn 
U.UU 


n nn 
U.UU 


n nn 
U.UU 


n nn 
U.UU 


n nn 
U.UU 


n nn 
U.UU 


n nn 
U.UU 


n nn 
U.UU 


n nn 
U.UU 







n nn 
U.UU 


n nn 
U.UU 


n nn 
U.UU 


n nn 
U.UU 


n nn 
U.UU 


n nn 
U.UU 


n nn 
U.UU 


n nn 
U.UU 


n nn 
U.UU 


r = 
1 


3 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


2 




0.99 


0.15 


0.65 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


3 




0.01 


0.75 


0.35 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


4 




0.00 


0.10 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


5 




0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


r = 
1 


5 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


2 




0.24 


0.00 


0.01 


0.48 


1.00 


1.00 


1.00 


1.00 


1.00 


3 




0.76 


0.86 


0.99 


0.52 


0.00 


0.00 


0.00 


0.00 


0.00 


4 




0.00 


0.13 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


5 




0.00 


0.01 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 


0.00 



HM models, we observed the possibility of computing the exact entropy only in presence 
of a reduced number of time occasions and latent states. Therefore, we also proposed two 
variants (named NECi and NEC2) that can be easily computed also for any number of 
time occasions and latent states. 

On the basis of some Monte Carlo simulations, we compared the performance of log- 
likelihood and classification based criteria for the latent states selection in univariate and 
multivariate LM models. Generally speaking, lower values of persistence probabilities in a 
same state and/or a greater uncertainty in the allocation of the observations to the latent 
states complicate the task of latent states selection procedures, leading to a generally 
worse performance of the adopted criteria. Instead, the number of observations does not 
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play a relevant role. 

Concerning the specific criteria, we observed that those based on the log-likelihood 
present a better general behavior with respect to those based on classification, even if 
AIC tends to overestimate the correct number of latent states, especially in multivariate 
cases. The classification based criteria tend to underestimate the true number of latent 
states, mainly for the univariate case, whereas their performance improves by increasing 
the number of observed response variables. We also observed a significant better behavior 
for NEC2 with respect to the other classification based criteria. Finally, by increasing 
the number of latent states the performance of all considered criteria gets worse, mainly 
in the univariate case. We conclude outlining that the results we obtained are coherent 
with those observed in the literature about HM models (see, among others, Costa and 
De Angehs, 2010). 

Concerning further developments of the present work, we intend to extend the sim- 
ulation study to LM models with covariates. We also intend to rely on the most recent 
advances in the context of HM models for a different formulation of the entropy that takes 
into account the tendency of traditional formulation to overestimate the uncertainty in 
these type of models (Durand and Guedon, 2012). 
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